Feature Extraction for Massive Data Mining
نویسندگان
چکیده
Techniques for learning from data typically require data to be in standard form. Measurements must be encoded in a numerical format such as binary true-or-false features, numerical features, or possibly numerical codes. In addition, for classification, a clear goal for learning must be specified. While some databases may readily be arranged in standard form, many others may be combinations of numerical fields or text, with thousands of possibilities for each data field, and multiple instances of the same field specification. A significant portion of the effort in real-world data mining applications involves defining, identifying and encoding the data into suitable features. In this paper, we describe an automatic feature extraction procedure, adapted from modern text categorization techniques, that maps very large databases into manageable datasets in standard form. We describe a commercial application of this procedure to mining a collection of very large databases of home appliance service records for a major international retailer.
منابع مشابه
Overlap-based feature weighting: The feature extraction of Hyperspectral remote sensing imagery
Hyperspectral sensors provide a large number of spectral bands. This massive and complex data structure of hyperspectral images presents a challenge to traditional data processing techniques. Therefore, reducing the dimensionality of hyperspectral images without losing important information is a very important issue for the remote sensing community. We propose to use overlap-based feature weigh...
متن کاملFeature extraction in opinion mining through Persian reviews
Opinion mining deals with an analysis of user reviews for extracting their opinions, sentiments and demands in a specific area, which can play an important role in making major decisions in such area. In general, opinion mining extracts user reviews at three levels of document, sentence and feature. Opinion mining at the feature level is taken into consideration more than the other two levels d...
متن کاملA Geometric View of Similarity Measures in Data Mining
The main objective of data mining is to acquire information from a set of data for prospect applications using a measure. The concerning issue is that one often has to deal with large scale data. Several dimensionality reduction techniques like various feature extraction methods have been developed to resolve the issue. However, the geometric view of the applied measure, as an additional consid...
متن کاملFeature extraction of hyperspectral images using boundary semi-labeled samples and hybrid criterion
Feature extraction is a very important preprocessing step for classification of hyperspectral images. The linear discriminant analysis (LDA) method fails to work in small sample size situations. Moreover, LDA has poor efficiency for non-Gaussian data. LDA is optimized by a global criterion. Thus, it is not sufficiently flexible to cope with the multi-modal distributed data. We propose a new fea...
متن کاملTransaction Encoding Algorithm (TEA) for Distributed Data
Analysis of huge datasets has been a major concern in almost all areas of technology in the past decade and the role of data mining has become so crucial as a result of this crisis. As the data sizes in these datasets increase, from gigabytes to terabytes or even larger the complexity in collecting and warehousing these massive dataset as such in a single site is practically impossible as it ma...
متن کاملEnhanced Discoverability of Content through Linked Data for Online Reviews using Classification and Ranking Techniques
Massive unstructured data are available and being posted in numerous blogs, forums, and online sites. This enormous amount of information on worldwide network platforms make them feasible and can be used as input source, in applications based on opinion mining and sentiment analysis. The aim of this paper is to analyze online reviews in unstructured form and discover content through linked data...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1995